When first attempting to model time series data, the first model considered is an autoregressive or moving average model. These AR/MA models are used to understand better the data and forecast future values based on historical data recorded previously. Before using these models, the time series must be stationary, meaning the mean and variance do not change over time. The most often cause of a violation of stationarity is a trend, where values slowly increase over time. So first, to check whether the data is stationary, we can use an ACF (autocorrelation function) and Dickey-Fuller Test to test for stationarity. Below is the Dickey-Fuller Test and the ACF plot for the coal consumption data explored and visualized in previous sections.
Code
theme_set(theme_gray(base_size=12,base_family="Palatino"))coal_df <-read.csv("data/coal_us_consumption.csv") |>group_by(period) |>summarise(Consumption =sum(consumption)) |>filter(!period %in%c('2000-Q1', '2000-Q2', '2000-Q3', '2000-Q4'))coal_df_ts <-ts(coal_df$Consumption, start =c(2001,1), end =c(2021,1), frequency =4)`Log Transformation`<-log(coal_df_ts)`Remove Seasonality`<-diff(log(coal_df_ts), lag=4)`First Diff After Remove Seasonality`<-diff(diff(log(coal_df_ts), lag=4))a <-ggAcf(coal_df_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")
The ACF plots show the differences among original data, log transformation, remove seasonality, and first difference with remove seasonality.
For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it’s not stationary either.
This ACF graph of detrended and differenced US total coal consumption data (the forth one from above) shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.
Stationarity Test & Seasonality Test
Stationarity
Code
adf.test(`First Diff After Remove Seasonality`)
Augmented Dickey-Fuller Test
data: First Diff After Remove Seasonality
Dickey-Fuller = -3.9404, Lag order = 4, p-value = 0.01703
alternative hypothesis: stationary
Stationarity test is less than 0.05, which means first order differencing and removing seasonality would make the time series dataset stationary. The dataset is no longer needed to be differenced or detrended.
Seasonality
Code
isSeasonal(`First Diff After Remove Seasonality`, test ="combined", freq =NA)
[1] FALSE
The isSeasonal function from seastests library shows our first-differenced and detrended data has no seasonality.
Now we are confident that our differenced and detrended data is ready for the model fitting.
Model Fitting
Code
log(coal_df_ts) |>diff() |>diff(lag=4) |>ggtsdisplay() #this is better
Code
#ggplotly(d) plotting the stationary data's ACF graph
The ACF and PACF graphs are showing that q = 0,1,2,3,4, d = 1, p = 0,1,2,3,4, Q = 0,1,2, P = 0,1,2,3, and D = 1
Code
SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){#K=(p2+1)*(q2+1)*(P2+1)*(Q2+1) temp=c() d=1 D=1 s=12 i=1 temp=data.frame() ls=matrix(rep(NA,9*100),nrow=100)for (p in p1:p2) {for(q in q1:q2) {for(P in P1:P2) {for(Q in Q1:Q2) {if(p+d+q+P+D+Q<=9) { model<-Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1)) ls[i,]=c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc) i=i+1#print(i) } } } } } temp=as.data.frame(ls)names(temp)=c("p","d","q","P","D","Q","AIC","BIC","AICc") temp}
From model fitting, we generated two models, ARIMA(0,1,0) x (2,1,1) and ARIMA(0,1,0) x (0,1,1). auto.arima() generated ARIMA(0,1,0) x (2,1,1) as well. ARIMA(0,1,0) x (2,1,1) has the lowest AIC while ARIMA(0,1,0) x (0,1,1) has the lowest BIC. Now let’s do two model diagnoses to analyze the result and find the better model to do forecast later.
ARIMA(0,1,0) x (2,1,1)
Code
fit1 <-Arima(coal_df_ts, order=c(0,1,0), seasonal =list(order =c(2,1,1), period =4))
coal_df_ts |>Arima(order=c(0,1,0), seasonal =list(order =c(2,1,1), period =4)) |>residuals() |>ggtsdisplay()
Code
checkresiduals(fit1)
Ljung-Box test
data: Residuals from ARIMA(0,1,0)(2,1,1)[4]
Q* = 8.2062, df = 5, p-value = 0.1452
Model df: 3. Total lags used: 8
The Ljung-Box test uses the following hypotheses:
H0: The residuals are independently distributed.
HA: The residuals are not independently distributed; they exhibit serial correlation.
Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.
There isn’t any significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting.
coal_df_ts |>Arima(order=c(0,1,0), seasonal =list(order =c(0,1,1), period =4)) |>residuals() |>ggtsdisplay()
Code
checkresiduals(fit2)
Ljung-Box test
data: Residuals from ARIMA(0,1,0)(0,1,1)[4]
Q* = 16.558, df = 7, p-value = 0.02048
Model df: 1. Total lags used: 8
The Ljung-Box test uses the following hypotheses:
H0: The residuals are independently distributed.
HA: The residuals are not independently distributed; they exhibit serial correlation.
Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.
There are two significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting.
Model Fitting
Code
coal_df_ts |>Arima(order=c(0,1,0), seasonal =list(order =c(0,1,1), period =4))
From model diagnostics above, ARIMA(0,1,0) x (2,1,1) is also the one auto.arima() produced, and it has less spikes in the ACF and PACF plots of its residuals. The model also has smaller sigma squared means it has smaller variance, which means the estimators with a smaller variance is more efficient.
Therefore, ARIMA(0,1,0) x (2,1,1) is the better model.
Forecasting
Let’s forecast for the next three years using the model we just selected.
values <-append(coal_df$Consumption, rep(NA,11))coal_ts_df <-data.frame(Year =seq(as.Date("2000/1/1"), by ="quarter", length.out =92), Short.tons=values)coal_ts_df$meanf <-append(rep(NA,80), meanf(coal_df_ts, h=12)$mean)coal_ts_df$naive <-append(rep(NA,80), naive(coal_df_ts, h=12)$mean)coal_ts_df$snaive <-append(rep(NA,80), snaive(coal_df_ts, h=12)$mean)coal_ts_df$rwf <-append(rep(NA,80), rwf(coal_df_ts, h=12, drift=TRUE)$mean)coal_ts_df$fit <-append(rep(NA,80), forecast(fit1, h=12)$mean)p <-ggplot(coal_ts_df) +geom_line(aes(x=Year, y = Short.tons)) +geom_line(aes(x=Year, y = meanf, color ="Mean")) +geom_line(aes(x=Year, y = naive, color ="Naïve")) +geom_line(aes(x=Year, y = snaive, color ="SNaïve")) +geom_line(aes(x=Year, y = rwf, color ="Drift")) +geom_line(aes(x=Year, y = fit, color ="Model")) +ggtitle("Comparison of the Fitted Model and Benchmark Methods") +ylab("Short tons")ggplotly(p)
Benchmark method is for data scientists to keep their sanity when building models, they set a baseline — a score that the model must outperform. Normally, the state-of-the-art is used as the baseline but for problems with no existing solutions yet, one should build their own baseline.
Average Method
This method simply takes the average (or “mean”) value of the entire historical data and use that to forecast future values. Very useful for data with small variance or whose value lies close to the mean.
Drift Method
Drift is the amount of change observed from the data. In this method, drift is set to be the average change seen in the whole historical data and uses that to forecast values in the future. Basically, this just means drawing a straight line using the first and last values and extend that line into the future. This method works well on data that follows a general trend over time.
Naïve Method
This method uses the most recent value as the forecasted value for the next time step. The assumption followed by this method is that its value tomorrow is equal to its value today.
Comparison Conclusion
The fitted model has outperformed the rest of the benchmark methods such us the mean method, naive methods, and drift method.
Modeling U.S. Total CO2 Emissions Data
When first attempting to model time series data, the first model considered is an autoregressive or moving average model. These AR/MA models are used to understand better the data and forecast future values based on historical data recorded previously. Before using these models, the time series must be stationary, meaning the mean and variance do not change over time. The most often cause of a violation of stationarity is a trend, where values slowly increase over time. So first, to check whether the data is stationary, we can use an ACF (autocorrelation function) and Dickey-Fuller Test to test for stationarity. Below is the Dickey-Fuller Test and the ACF plot for the CO2 emissions data explored and visualized in previous sections.
Code
theme_set(theme_gray(base_size=12,base_family="Palatino"))emissions_df <-read.csv("data/df_total_monthly_CO2_emissions.csv", skip=1, row.names =1)emissions_ts <-ts(emissions_df$sum, start =c(1973,1), end =c(2022,12), frequency =12)`Log Transformation`<-log(emissions_ts)`Remove Seasonality`<-diff(log(emissions_ts), lag=12)`First Diff After Remove Seasonality`<-diff(diff(log(emissions_ts), lag=12))a <-ggAcf(emissions_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")
The ACF plots show the differences among original data, log transformation, remove seasonality, and first difference with remove seasonality.
For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it’s not stationary either.
This ACF graph of detrended and differenced US total coal consumption data (the forth one from above) shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.
Stationarity Test & Seasonality Test
Stationarity
Code
adf.test(`First Diff After Remove Seasonality`)
Warning in adf.test(`First Diff After Remove Seasonality`): p-value smaller than
printed p-value
Augmented Dickey-Fuller Test
data: First Diff After Remove Seasonality
Dickey-Fuller = -10.227, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary
Stationarity test is less than 0.05, which means first order differencing and removing seasonality would make the time series dataset stationary. The dataset is no longer needed to be differenced or detrended.
Seasonality
Code
isSeasonal(`First Diff After Remove Seasonality`, test ="combined", freq =NA)
[1] FALSE
The isSeasonal function from seastests library shows our first-differenced and detrended data has no seasonality.
Now we are confident that our differenced and detrended data is ready for the model fitting.
Model Fitting
Code
log(emissions_ts) |>diff() |>diff(lag=12) |>ggtsdisplay() #this is better
Code
#ggplotly(d) plotting the stationary data's ACF graph
The ACF and PACF graphs are showing that q = 0,1,2, d = 1,2, p = 0,1,2, Q = 0,1,2,3,4 P = 0,1,2,3,4, and D = 1,2
Code
SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,d,D,data){#K=(p2+1)*(q2+1)*(P2+1)*(Q2+1) temp=c() s=12 i=1 temp=data.frame() ls=matrix(rep(NA,9*100),nrow=100)for (p in p1:p2) {for(q in q1:q2) {for(P in P1:P2) {for(Q in Q1:Q2) {if(p+d+q+P+D+Q<=9) { model<-Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1)) ls[i,]=c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc) i=i+1#print(i) } } } } } temp=as.data.frame(ls)names(temp)=c("p","d","q","P","D","Q","AIC","BIC","AICc") temp}
Code
# q = 0,1,2, d = 1, p = 0,1,2, Q = 0,1,2,3,4 P = 0,1,2,3,4, and D = 1output1 <-SARIMA.c(p1 =1, p2 =3, q1 =1, q2 =3, P1 =1, P2 =4, Q1 =1, Q2 =4, d=1, D=1, data = emissions_ts) |>drop_na()minaic <- output[which.min(output1$AIC), ]minbic <- output[which.min(output1$BIC), ]
From model fitting, we generated 1 model, ARIMA(1,1,1) x (0,1,1). auto.arima() generated ARIMA(1,1,1) x (2,1,2). It looks like ARIMA(1,1,1) x (2,1,2) has the lowest AIC, BIC, and AICc. Now let’s do two model diagnoses to analyze the result and find the better model to do forecast later.
ARIMA(1,1,1) x (2,1,2)
Code
fit1 <-Arima(emissions_ts, order=c(1,1,1), seasonal =list(order =c(2,1,2), period =12))
emissions_ts |>Arima(order=c(1,1,1), seasonal =list(order =c(2,1,2), period =12)) |>residuals() |>ggtsdisplay()
Code
checkresiduals(fit1)
Ljung-Box test
data: Residuals from ARIMA(1,1,1)(2,1,2)[12]
Q* = 26.72, df = 18, p-value = 0.0844
Model df: 6. Total lags used: 24
The Ljung-Box test uses the following hypotheses:
H0: The residuals are independently distributed.
HA: The residuals are not independently distributed; they exhibit serial correlation.
Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.
There is one significant spike in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting.
Series: emissions_ts
ARIMA(1,1,1)(0,1,1)[12]
Coefficients:
ar1 ma1 sma1
0.4287 -0.8131 -0.8291
s.e. 0.0727 0.0497 0.0218
sigma^2 = 307.2: log likelihood = -2519.7
AIC=5047.41 AICc=5047.48 BIC=5064.91
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.04531625 17.29221 12.97933 -0.06676721 2.213159 0.5871565
ACF1
Training set -0.00373604
Model Selection
From model diagnostics above, ARIMA(1,1,1) x (2,1,2) is also the one auto.arima() produced, and it has less spikes in the ACF and PACF plots of its residuals. The model also has smaller sigma squared means it has smaller variance, which means the estimators with a smaller variance is more efficient. Nontheless, all the training set error measures, especially RMSE, of first model are lower than the ones of the second model.
Therefore, ARIMA(1,1,1) x (2,1,2) is the better model.
Forecasting
Let’s forecast for the next three years using the model we just selected.
forecast Method
Code
a <- fit1 |>forecast(h=80) |>autoplotly() +ggtitle("Monthy CO2 Emissions Five-Year-Forecasting") +xlab("Year") +ylab("Million Metric Tons of Carbon Dioxide")a %>%layout(showlegend = F, title='Time Series with Rangeslider',xaxis =list(rangeslider =list(visible = T)))
sarima.for() Method
Code
sarima.for(emissions_ts, 60, 1,1,1,2,1,2,12)
$pred
Jan Feb Mar Apr May Jun Jul Aug
2023 618.7850 522.7436 523.7467 456.6446 484.6787 525.0852 592.4839 588.2336
2024 595.4889 510.3660 509.0163 439.2943 467.0804 511.1400 581.6272 578.6118
2025 581.2447 504.6408 496.4980 429.0742 455.7201 501.7074 571.3342 569.9160
2026 574.3190 499.8097 488.2619 422.9227 448.9569 495.0954 563.3664 562.5627
2027 569.2822 493.6951 481.9732 417.2486 443.2297 488.9319 556.6300 555.7564
Sep Oct Nov Dec
2023 518.0627 499.4276 511.9846 573.6319
2024 507.6640 490.8693 500.3659 555.9238
2025 496.9108 480.3016 490.3520 544.5836
2026 488.5437 471.5427 482.7586 537.8624
2027 481.6952 464.4100 476.1690 532.1546
$se
Jan Feb Mar Apr May Jun Jul Aug
2023 16.98534 19.86316 21.27671 22.26454 23.09347 23.85051 24.56782 25.25844
2024 29.43616 30.53130 31.44341 32.27622 33.06763 33.83262 34.57761 35.30568
2025 38.73656 39.38616 40.02545 40.65479 41.27458 41.88521 42.48707 43.08052
2026 46.16912 46.85184 47.48856 48.10286 48.70392 49.29551 49.87923 50.45586
2027 53.80342 54.64802 55.39405 56.09763 56.77985 57.44901 58.10852 58.75985
Sep Oct Nov Dec
2023 25.92822 26.58017 27.21614 27.83744
2024 36.01857 36.71743 37.40316 38.07652
2025 43.66591 44.24356 44.81376 45.37679
2026 51.02584 51.58947 52.14699 52.69860
2027 59.40374 60.04060 60.67073 61.29436
Compare with benchmark method
Code
values <-append(emissions_df$sum, rep(NA,80))emissions_ts_df <-data.frame(Year =seq(as.Date("1973/1/1"), by ="month", length.out =680), Million.metric.tons=values)emissions_ts_df$meanf <-append(rep(NA,600), meanf(emissions_ts, h=80)$mean)emissions_ts_df$naive <-append(rep(NA,600), naive(emissions_ts, h=80)$mean)emissions_ts_df$snaive <-append(rep(NA,600), snaive(emissions_ts, h=80)$mean)emissions_ts_df$rwf <-append(rep(NA,600), rwf(emissions_ts, h=80, drift=TRUE)$mean)emissions_ts_df$fit <-append(rep(NA,600), forecast(fit1, h=80)$mean)p <-ggplot(emissions_ts_df) +geom_line(aes(x=Year, y = Million.metric.tons)) +geom_line(aes(x=Year, y = meanf, color ="Mean")) +geom_line(aes(x=Year, y = naive, color ="Naïve")) +geom_line(aes(x=Year, y = snaive, color ="SNaïve")) +geom_line(aes(x=Year, y = rwf, color ="Drift")) +geom_line(aes(x=Year, y = fit, color ="Model")) +ggtitle("Comparison of the Fitted Model and Benchmark Methods") +ylab("Million Metric Tons of Carbon Dioxide")ggplotly(p)
Benchmark method is for data scientists to keep their sanity when building models, they set a baseline — a score that the model must outperform. Normally, the state-of-the-art is used as the baseline but for problems with no existing solutions yet, one should build their own baseline.
Average Method
This method simply takes the average (or “mean”) value of the entire historical data and use that to forecast future values. Very useful for data with small variance or whose value lies close to the mean.
Drift Method
Drift is the amount of change observed from the data. In this method, drift is set to be the average change seen in the whole historical data and uses that to forecast values in the future. Basically, this just means drawing a straight line using the first and last values and extend that line into the future. This method works well on data that follows a general trend over time.
Naïve Method
This method uses the most recent value as the forecasted value for the next time step. The assumption followed by this method is that its value tomorrow is equal to its value today.
Comparison Conclusion
The fitted model has outperformed the rest of the benchmark methods such us the mean method, naive methods, and drift method.
Source Code
---title: "ARMA/ARIMA/SARIMA Models"format: html: embed-resources: true self-contained: true code-fold: true code-copy: true code-tools: true code-overflow: wrap---# Modeling U.S. Total Coal Consumption Data```{r,echo=FALSE,message=FALSE,warning=FALSE}library(tidyverse)library(ggplot2)library(forecast)library(astsa) library(xts)library(tseries)library(fpp2)library(fma)library(lubridate)library(tidyverse)library(TSstudio)library(quantmod)library(tidyquant)library(plotly)library(ggplot2)library(ggfortify)library(autoplotly)library(ggpubr)library(seastests)```When first attempting to model time series data, the first model considered is an **autoregressive** or **moving average model**. These AR/MA models are used to understand better the data and forecast future values based on historical data recorded previously. Before using these models, the time series must be stationary, meaning the mean and variance do not change over time. The most often cause of a violation of stationarity is a trend, where values slowly increase over time. So first, to check whether the data is stationary, we can use an ACF (autocorrelation function) and Dickey-Fuller Test to test for stationarity. Below is the Dickey-Fuller Test and the ACF plot for the coal consumption data explored and visualized in previous sections.| <br>```{r}theme_set(theme_gray(base_size=12,base_family="Palatino"))coal_df <-read.csv("data/coal_us_consumption.csv") |>group_by(period) |>summarise(Consumption =sum(consumption)) |>filter(!period %in%c('2000-Q1', '2000-Q2', '2000-Q3', '2000-Q4'))coal_df_ts <-ts(coal_df$Consumption, start =c(2001,1), end =c(2021,1), frequency =4)`Log Transformation`<-log(coal_df_ts)`Remove Seasonality`<-diff(log(coal_df_ts), lag=4)`First Diff After Remove Seasonality`<-diff(diff(log(coal_df_ts), lag=4))a <-ggAcf(coal_df_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")```## ACF Plots::: panel-tabset## Original and Log Transformation```{r}ggarrange(a, b, ncol =1, nrow =2)```## Detrend and Remove Seasonality```{r}ggarrange(c, d, ncol =1, nrow =2)```:::The ACF plots show the differences among original data, log transformation, remove seasonality, and first difference with remove seasonality.For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it's not stationary either.This ACF graph of detrended and differenced US total coal consumption data (the forth one from above) shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.| <br>## Stationarity Test & Seasonality Test### Stationarity```{r}adf.test(`First Diff After Remove Seasonality`)```Stationarity test is less than 0.05, which means first order differencing and removing seasonality would make the time series dataset stationary. The dataset is no longer needed to be differenced or detrended.### Seasonality| <br>```{r}isSeasonal(`First Diff After Remove Seasonality`, test ="combined", freq =NA)```The `isSeasonal` function from `seastests` library shows our first-differenced and detrended data has no seasonality.Now we are confident that our differenced and detrended data is ready for the model fitting.| <br><br>## Model Fitting```{r}log(coal_df_ts) |>diff() |>diff(lag=4) |>ggtsdisplay() #this is better#ggplotly(d) plotting the stationary data's ACF graph```The ACF and PACF graphs are showing that q = 0,1,2,3,4, d = 1, p = 0,1,2,3,4, Q = 0,1,2, P = 0,1,2,3, and D = 1```{r}SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,data){#K=(p2+1)*(q2+1)*(P2+1)*(Q2+1) temp=c() d=1 D=1 s=12 i=1 temp=data.frame() ls=matrix(rep(NA,9*100),nrow=100)for (p in p1:p2) {for(q in q1:q2) {for(P in P1:P2) {for(Q in Q1:Q2) {if(p+d+q+P+D+Q<=9) { model<-Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1)) ls[i,]=c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc) i=i+1#print(i) } } } } } temp=as.data.frame(ls)names(temp)=c("p","d","q","P","D","Q","AIC","BIC","AICc") temp}``````{r, warning=FALSE}output <-SARIMA.c(p1 =1, p2 =5, q1 =1, q2 =5, P1 =1, P2 =4, Q1 =1, Q2 =3, data = coal_df_ts) |>drop_na()minaic <- output[which.min(output$AIC), ]minbic <- output[which.min(output$BIC), ]```::: panel-tabset## Parameters with Minimum AIC```{r}knitr::kable(minaic)```## Parameters with Minimum BIC```{r}knitr::kable(minbic)```## Using `auto.arima()````{r}auto.arima(coal_df_ts)```:::| <br>## Model diagnosticsFrom model fitting, we generated two models, ARIMA(0,1,0) x (2,1,1) and ARIMA(0,1,0) x (0,1,1). `auto.arima()` generated ARIMA(0,1,0) x (2,1,1) as well. ARIMA(0,1,0) x (2,1,1) has the lowest AIC while ARIMA(0,1,0) x (0,1,1) has the lowest BIC. Now let's do two model diagnoses to analyze the result and find the better model to do forecast later.### ARIMA(0,1,0) x (2,1,1)```{r}fit1 <-Arima(coal_df_ts, order=c(0,1,0), seasonal =list(order =c(2,1,1), period =4))```#### Model Fitting Visual Results and Residuals```{r}set.seed(123)model_output <-capture.output(sarima(coal_df_ts,0,1,0,2,1,1,4))``````{r}coal_df_ts |>Arima(order=c(0,1,0), seasonal =list(order =c(2,1,1), period =4)) |>residuals() |>ggtsdisplay()``````{r}checkresiduals(fit1)```*The Ljung-Box test uses the following hypotheses:**H~0~: The residuals are independently distributed.**H~A~: The residuals are not independently distributed; they exhibit serial correlation.**Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.*> <br>There isn't any significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting.#### Model Fitting```{r}coal_df_ts |>Arima(order=c(0,1,0), seasonal =list(order=c(2,1,1), period=4))```#### Model Output Diagnostics```{r}cat(model_output[25:56], model_output[length(model_output)], sep ="\n") ```### ARIMA(0,1,0) x (0,1,1)```{r}fit2 <-Arima(coal_df_ts, order=c(0,1,0), seasonal =list(order =c(0,1,1), period =4))```#### Model Fitting Visual Results and Residuals```{r}set.seed(123)model_output2 <-capture.output(sarima(coal_df_ts,0,1,0,0,1,1,4))``````{r}coal_df_ts |>Arima(order=c(0,1,0), seasonal =list(order =c(0,1,1), period =4)) |>residuals() |>ggtsdisplay()``````{r}checkresiduals(fit2)```*The Ljung-Box test uses the following hypotheses:**H~0~: The residuals are independently distributed.**H~A~: The residuals are not independently distributed; they exhibit serial correlation.**Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.*> <br>There are two significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting.#### Model Fitting```{r}coal_df_ts |>Arima(order=c(0,1,0), seasonal =list(order =c(0,1,1), period =4))```#### Model Output Diagnostics```{r}cat(model_output2[20:49], model_output2[length(model_output2)], sep ="\n")```| <br>### Model SelectionFrom model diagnostics above, ARIMA(0,1,0) x (2,1,1) is also the one `auto.arima()` produced, and it has less spikes in the ACF and PACF plots of its residuals. The model also has smaller sigma squared means it has smaller variance, which means the estimators with a smaller variance is more efficient.Therefore, ARIMA(0,1,0) x (2,1,1) is the better model.| <br>## ForecastingLet's forecast for the next three years using the model we just selected.### `forecast` Method```{r}c <- fit1 |>forecast(h=12) |>autoplotly() +ggtitle("Quarterly Coal Consumption Three-Year-Forecasting") +xlab("Year") +ylab("Short Ton")c |>layout(showlegend = F,xaxis =list(rangeslider =list(visible = T)))```### `sarima.for()` Method```{r}sarima.for(coal_df_ts, 12, 0,1,0,2,1,1,4)```| <br>## Compare with benchmark method```{r, warning=FALSE}values <-append(coal_df$Consumption, rep(NA,11))coal_ts_df <-data.frame(Year =seq(as.Date("2000/1/1"), by ="quarter", length.out =92), Short.tons=values)coal_ts_df$meanf <-append(rep(NA,80), meanf(coal_df_ts, h=12)$mean)coal_ts_df$naive <-append(rep(NA,80), naive(coal_df_ts, h=12)$mean)coal_ts_df$snaive <-append(rep(NA,80), snaive(coal_df_ts, h=12)$mean)coal_ts_df$rwf <-append(rep(NA,80), rwf(coal_df_ts, h=12, drift=TRUE)$mean)coal_ts_df$fit <-append(rep(NA,80), forecast(fit1, h=12)$mean)p <-ggplot(coal_ts_df) +geom_line(aes(x=Year, y = Short.tons)) +geom_line(aes(x=Year, y = meanf, color ="Mean")) +geom_line(aes(x=Year, y = naive, color ="Naïve")) +geom_line(aes(x=Year, y = snaive, color ="SNaïve")) +geom_line(aes(x=Year, y = rwf, color ="Drift")) +geom_line(aes(x=Year, y = fit, color ="Model")) +ggtitle("Comparison of the Fitted Model and Benchmark Methods") +ylab("Short tons")ggplotly(p)```Benchmark method is for data scientists to keep their sanity when building models, they set a baseline --- a score that the model must outperform. Normally, the state-of-the-art is used as the baseline but for problems with no existing solutions yet, one should build their own baseline.### Average MethodThis method simply takes the average (or "mean") value of the entire historical data and use that to forecast future values. Very useful for data with small variance or whose value lies close to the mean.### Drift MethodDrift is the amount of change observed from the data. In this method, drift is set to be the average change seen in the whole historical data and uses that to forecast values in the future. Basically, this just means drawing a straight line using the first and last values and extend that line into the future. This method works well on data that follows a general trend over time.### Naïve MethodThis method uses the most recent value as the forecasted value for the next time step. The assumption followed by this method is that its value tomorrow is equal to its value today.### Comparison ConclusionThe fitted model has outperformed the rest of the benchmark methods such us the mean method, naive methods, and drift method.# Modeling U.S. Total CO2 Emissions DataWhen first attempting to model time series data, the first model considered is an **autoregressive** or **moving average model**. These AR/MA models are used to understand better the data and forecast future values based on historical data recorded previously. Before using these models, the time series must be stationary, meaning the mean and variance do not change over time. The most often cause of a violation of stationarity is a trend, where values slowly increase over time. So first, to check whether the data is stationary, we can use an ACF (autocorrelation function) and Dickey-Fuller Test to test for stationarity. Below is the Dickey-Fuller Test and the ACF plot for the CO2 emissions data explored and visualized in previous sections.> <br>```{r}theme_set(theme_gray(base_size=12,base_family="Palatino"))emissions_df <-read.csv("data/df_total_monthly_CO2_emissions.csv", skip=1, row.names =1)emissions_ts <-ts(emissions_df$sum, start =c(1973,1), end =c(2022,12), frequency =12)`Log Transformation`<-log(emissions_ts)`Remove Seasonality`<-diff(log(emissions_ts), lag=12)`First Diff After Remove Seasonality`<-diff(diff(log(emissions_ts), lag=12))a <-ggAcf(emissions_ts,70) +ggtitle("Original Data")b <-ggAcf(`Log Transformation`,70) +ggtitle("Log Transformation")c <-ggAcf(`Remove Seasonality`,70) +ggtitle("Remove Seasonality")d <-ggAcf(`First Diff After Remove Seasonality`,70) +ggtitle("First Diff After Remove Seasonality")```## ACF Plots::: panel-tabset## Original and Log Transformation```{r}ggarrange(a, b, ncol =1, nrow =2)```## Detrend and Remove Seasonality```{r}ggarrange(c, d, ncol =1, nrow =2)```:::The ACF plots show the differences among original data, log transformation, remove seasonality, and first difference with remove seasonality.For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it's not stationary either.This ACF graph of detrended and differenced US total coal consumption data (the forth one from above) shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.| <br>| ## Stationarity Test & Seasonality Test### Stationarity```{r}adf.test(`First Diff After Remove Seasonality`)```Stationarity test is less than 0.05, which means first order differencing and removing seasonality would make the time series dataset stationary. The dataset is no longer needed to be differenced or detrended.### Seasonality| <br>```{r}isSeasonal(`First Diff After Remove Seasonality`, test ="combined", freq =NA)```The `isSeasonal` function from `seastests` library shows our first-differenced and detrended data has no seasonality.Now we are confident that our differenced and detrended data is ready for the model fitting.| <br><br>## Model Fitting```{r}log(emissions_ts) |>diff() |>diff(lag=12) |>ggtsdisplay() #this is better#ggplotly(d) plotting the stationary data's ACF graph```The ACF and PACF graphs are showing that q = 0,1,2, d = 1,2, p = 0,1,2, Q = 0,1,2,3,4 P = 0,1,2,3,4, and D = 1,2```{r}SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,d,D,data){#K=(p2+1)*(q2+1)*(P2+1)*(Q2+1) temp=c() s=12 i=1 temp=data.frame() ls=matrix(rep(NA,9*100),nrow=100)for (p in p1:p2) {for(q in q1:q2) {for(P in P1:P2) {for(Q in Q1:Q2) {if(p+d+q+P+D+Q<=9) { model<-Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1)) ls[i,]=c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc) i=i+1#print(i) } } } } } temp=as.data.frame(ls)names(temp)=c("p","d","q","P","D","Q","AIC","BIC","AICc") temp}``````{r, warning=FALSE}# q = 0,1,2, d = 1, p = 0,1,2, Q = 0,1,2,3,4 P = 0,1,2,3,4, and D = 1output1 <-SARIMA.c(p1 =1, p2 =3, q1 =1, q2 =3, P1 =1, P2 =4, Q1 =1, Q2 =4, d=1, D=1, data = emissions_ts) |>drop_na()minaic <- output[which.min(output1$AIC), ]minbic <- output[which.min(output1$BIC), ]```::: panel-tabset## Parameters with Minimum AIC```{r}knitr::kable(minaic)```## Parameters with Minimum BIC```{r}knitr::kable(minbic)```## Using `auto.arima()````{r}auto.arima(emissions_ts)```:::| <br>## Model diagnosticsFrom model fitting, we generated 1 model, ARIMA(1,1,1) x (0,1,1). `auto.arima()` generated ARIMA(1,1,1) x (2,1,2). It looks like ARIMA(1,1,1) x (2,1,2) has the lowest AIC, BIC, and AICc. Now let's do two model diagnoses to analyze the result and find the better model to do forecast later.### ARIMA(1,1,1) x (2,1,2)```{r}fit1 <-Arima(emissions_ts, order=c(1,1,1), seasonal =list(order =c(2,1,2), period =12))```#### Model Fitting Visual Results and Residuals```{r}set.seed(123)model_output <-capture.output(sarima(emissions_ts,1,1,1,2,1,2,12))``````{r}emissions_ts |>Arima(order=c(1,1,1), seasonal =list(order =c(2,1,2), period =12)) |>residuals() |>ggtsdisplay()``````{r}checkresiduals(fit1)```*The Ljung-Box test uses the following hypotheses:**H~0~: The residuals are independently distributed.**H~A~: The residuals are not independently distributed; they exhibit serial correlation.**Ideally, we would like to fail to reject the null hypothesis. That is, we would like to see the p-value of the test be greater than 0.05 because this means the residuals for our time series model are independent, which is often an assumption we make when creating a model.*> <br>There is one significant spike in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting.#### Model Fitting```{r}emissions_ts |>Arima(order=c(1,1,1), seasonal =list(order=c(2,1,2), period=12))```#### Model Output Diagnostics```{r}cat(model_output[52:86], model_output[length(model_output)], sep ="\n") ``````{r}summary(fit1)```### ARIMA(1,1,1) x (0,1,1)```{r}fit2 <-Arima(emissions_ts, order=c(1,1,1), seasonal =list(order =c(0,1,1), period =12))```#### Model Fitting Visual Results and Residuals```{r}set.seed(123)model_output2 <-capture.output(sarima(emissions_ts,1,1,1,0,1,1,12))```#### ```{r}emissions_ts |>Arima(order=c(1,1,1), seasonal =list(order =c(0,1,1), period =12)) |>residuals() |>ggtsdisplay()``````{r}checkresiduals(fit2)```There is two significant spikes in the ACF, and the model didn't fail the Ljung-Box test. The model can't be used for forecasting.#### Model Fitting```{r}emissions_ts |>Arima(order=c(1,1,1), seasonal =list(order =c(0,1,1), period =12))```#### Model Output Diagnostics```{r}cat(model_output2[35:66], model_output2[length(model_output2)], sep ="\n")``````{r}summary(fit2)```| <br>### Model SelectionFrom model diagnostics above, ARIMA(1,1,1) x (2,1,2) is also the one `auto.arima()` produced, and it has less spikes in the ACF and PACF plots of its residuals. The model also has smaller sigma squared means it has smaller variance, which means the estimators with a smaller variance is more efficient. Nontheless, all the training set error measures, especially RMSE, of first model are lower than the ones of the second model.Therefore, ARIMA(1,1,1) x (2,1,2) is the better model.| <br>## ForecastingLet's forecast for the next three years using the model we just selected.### `forecast` Method```{r}a <- fit1 |>forecast(h=80) |>autoplotly() +ggtitle("Monthy CO2 Emissions Five-Year-Forecasting") +xlab("Year") +ylab("Million Metric Tons of Carbon Dioxide")a %>%layout(showlegend = F, title='Time Series with Rangeslider',xaxis =list(rangeslider =list(visible = T)))```### `sarima.for()` Method```{r}sarima.for(emissions_ts, 60, 1,1,1,2,1,2,12)```| <br>## Compare with benchmark method```{r, warning=FALSE}values <-append(emissions_df$sum, rep(NA,80))emissions_ts_df <-data.frame(Year =seq(as.Date("1973/1/1"), by ="month", length.out =680), Million.metric.tons=values)emissions_ts_df$meanf <-append(rep(NA,600), meanf(emissions_ts, h=80)$mean)emissions_ts_df$naive <-append(rep(NA,600), naive(emissions_ts, h=80)$mean)emissions_ts_df$snaive <-append(rep(NA,600), snaive(emissions_ts, h=80)$mean)emissions_ts_df$rwf <-append(rep(NA,600), rwf(emissions_ts, h=80, drift=TRUE)$mean)emissions_ts_df$fit <-append(rep(NA,600), forecast(fit1, h=80)$mean)p <-ggplot(emissions_ts_df) +geom_line(aes(x=Year, y = Million.metric.tons)) +geom_line(aes(x=Year, y = meanf, color ="Mean")) +geom_line(aes(x=Year, y = naive, color ="Naïve")) +geom_line(aes(x=Year, y = snaive, color ="SNaïve")) +geom_line(aes(x=Year, y = rwf, color ="Drift")) +geom_line(aes(x=Year, y = fit, color ="Model")) +ggtitle("Comparison of the Fitted Model and Benchmark Methods") +ylab("Million Metric Tons of Carbon Dioxide")ggplotly(p)```Benchmark method is for data scientists to keep their sanity when building models, they set a baseline --- a score that the model must outperform. Normally, the state-of-the-art is used as the baseline but for problems with no existing solutions yet, one should build their own baseline.### Average MethodThis method simply takes the average (or "mean") value of the entire historical data and use that to forecast future values. Very useful for data with small variance or whose value lies close to the mean.### Drift MethodDrift is the amount of change observed from the data. In this method, drift is set to be the average change seen in the whole historical data and uses that to forecast values in the future. Basically, this just means drawing a straight line using the first and last values and extend that line into the future. This method works well on data that follows a general trend over time.### Naïve MethodThis method uses the most recent value as the forecasted value for the next time step. The assumption followed by this method is that its value tomorrow is equal to its value today.### Comparison ConclusionThe fitted model has outperformed the rest of the benchmark methods such us the mean method, naive methods, and drift method.#